# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Tue Dec 7 11:18:53 2021"
I’m really excited to start this course. I expect to learn a lot of using R and RStudio.
Here is a link to my GitHub repository: https://github.com/Kuukkelipyy/IODS-project
date()
## [1] "Tue Dec 7 11:18:53 2021"
The second week in short:
The original data was collected in 2014/2015 from students participating in Introduction to Social Statistics -course. For more information about the data and its variables see the data description.
# read data from local file
learning2014 <- read.table("data/learning2014.csv")
# examine the structure and dimensions of the data
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 7
After data wrangling the data set utilized here consists of 166 observations of 7 variables (i.e. 166 rows and 7 columns), meaning that the data provides information from 166 respondents on seven variables. Deep, stra and surf are combination variables based on multiple questions measuring surface, strategic and deep learning (for more information about the variables see here). My code for the preparation of the data can be found in my repository.Next, I’ll present an overview of the data and its variables.
# summary of the variables in the data
summary(learning2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :14.00 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333
## Mode :character Median :22.00 Median :32.00 Median :3.667
## Mean :25.51 Mean :31.43 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083
## Max. :55.00 Max. :50.00 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
# show summary of the gender variable when it is transformed to factor
summary(as.factor(learning2014$gender))
## F M
## 110 56
# access to libraries ggplot2 and GGally
## if not installed yet, use: install.packages("name_of_package")
library(GGally)
library(ggplot2)
# scatter plot matrix using ggplot2 and GGally -packages
ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))
## note: if scatter plot matrix is drawn by using r basic function pairs(), it seems that the gender variable needs to be transformed to a factor (in the Datacamp execise gender is a factor, not character)
# pairs(learning2014[-1], col = as.factor(learning2014$gender))
Above the summary tables show a summary of each variable in the data. Gender is a categorical (‘character’) variable, while the others are continuous (‘numerical’). Regarding the numerical variables the summary provides information about their minimum, maximum and median values as well as lower (1st) and upper (3rd) quartiles of their distributions. As gender variable is a ‘character’, the summary shows only the number of total observations. When the variable is transformed to a factor, the summary function shows the number of female and male respondents.
The graphical presentation of the distributions of all the variables can be seen in the scatter plot matrix. The scatter plot matrix also provides statistics about the correlation between the variables. Here I focus only on the last column of the scatter plot matrix which shows the correlation between ‘points’ and other continuous variables; the highest correlation coefficient is found between points and attitude (0.437), followed by variables stra (0.146) and surf (-0.144), while the correlation coefficient between points and age (-0.093) or points and deep (-0.010) are really low. According to the scatter plot matrix only the correlation between points and attitude is statistically significant (p < 0.001). Also worth to notice, that there are statistically significant correlations between surf and attitude, surf and deep and surf and stra, indicating a potential problem of multicollinearity if those variables were used simultaneously as explanatory variables in a linear regression model.
In the following linear regression model ‘points’ is the response variable and - on the grounds of the correlation coefficients - attitude, stra and surf are selected as explanatory variables.
model1 <- lm(points ~ surf + stra + attitude, data = learning2014)
summary(model1)
##
## Call:
## lm(formula = points ~ surf + stra + attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## surf -0.58607 0.80138 -0.731 0.46563
## stra 0.85313 0.54159 1.575 0.11716
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The results of the multiple linear regression model are shown in the summary table above. F-statistic with low p-value (p < 0.001) indicates that all the regression coefficients in the model are not zero, meaning that at least one of the variables included in the model has an actual statistical effect on the target variable (points). The multiple R-squared illustrates how well the model is fitting the data. Here, the multiple R-squared (0.2074) shows, that the model (i.e. explanatory variables) explains around 21 % of the variation in the response variable. The summary of residuals (i.e. the difference between observed and predicted response values) provides information about the symmetry of the residual distribution. Residual are dealt more in chapter 2.4 along with model diagnostics.
The regression coefficient describes the relationship between explanatory and response variables; it illustrates the change in the response variable when the explanatory variable alter by one unit and the other variables stay constant. According to the coefficient table the estimates of surf and stra are not statistically significant. Instead the estimate of attitude is statistically significant with very low p-value (p < 0.001), showing that when attitude increases by one unit the points increase 0.3395.
Based on the above presented results from multiple linear regression model, I modify the model by excluding variables which regression coefficients are not statistically significant (p > 0.05). Namely, variables surf and stra are dropped out and only attitude will be included in the model.
model2 <- lm(points ~ attitude, data = learning2014)
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The results from the simple linear regression analysis are similar to above discussed results from the multivariate linear model. The multiple R-squared decreases a little bit showing that the new model explains about 19 % of the variation in the response variable. The value of multiple R-squared typically decreases when variables are taken out from the model (and increases when more variables are included), while the Adjusted R-squared takes account for the number of explanatory variables in the model. The Adjusted R-squared is practically the same in both models, that is around 19%.
The coefficient table includes two estimates: intercept and attitude. The estimate of the intercept term can be interpreted here as the expected value of the response variable (points) when the explanatory variable (attitude) is 0. This information could be utilized, for instance, if we want to calculate/predict the points based on the known values of the explanatory variable. For example, if a person’s attitude is 25, we can predict their points by calculating as follows: 11.64715 + 0.35255 x 25 = 20,4509 (i.e. intercept + regression coefficient of attitude x value of attitude = points).
The estimate of attitude describes the relationship between attitude and points; the statistically significant (p < 0.001) regression coefficient of attitude (0.3526) shows that when attitude increases by one, the points increase by 0.3526 points.
The key assumptions of linear regression model examined here are:
The validity of these assumptions can be explored by analyzing the residuals. The figure below consists of three diagnostic plots: 1) ‘Residuals vs. Fitted values’, ‘2) Normal QQ-plot’ and 3) ‘Residuals vs. Leverage’.
par(mfrow = c(2, 2)) # to put the following figures in one 2x2 figure
plot(model2, which = c(1,2, 5))
In the figure ‘Residuals vs. fitted’ the residuals are plotted against the fitted values of the response variable. The plot provides a graphical method to examine whether the errors have constant variance. A pattern in the scatter plot would indicate a problem relating this assumption, and imply that the size of the errors depend on the explanatory variables. Here, the residuals seem to be reasonably randomly spread out above and below the line suggesting that the assumptions relating to linearity and constant variance are valid.
Normal QQ-plot of the residuals helps to explore whether the errors are normally distributed. In an ideal case, the residuals would have a perfect match with the diagonal line, and in practice the better the points follow the line, the stronger evidence for the normality assumption. In the QQ-plot above, the residuals fit the line reasonably well, although a little curve is visible as the lower and upper tail deviate slightly from the diagonal line.
‘Residuals vs. leverage’ plot describes how large influence a single observations has in the model and thus helps to identify if some observations have an unusually strong impact on determining the regression line. Observations falling outside of the Cook’s distance line (the dotted red line) are considered to be highly influential to the model fit (as they have large residual and high leverage), meaning that the results would change considerably if the observation were removed from the data. In the figure above the “Cook’s distance” line does not even appear in the plot indicating that none of the observation have an unusually large impact on the results.
date()
## [1] "Tue Dec 7 11:18:58 2021"
My third week in short
# access to all packages needed
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(ggpubr)
library(boot)
# read the prepared data set from local file
alc <- read.table("data/alc.csv", sep = ";")
# examine the data
dim(alc)
## [1] 370 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
In this exercise the data is drawn from the Student Performance Data set, which includes information on student achievements in secondary education of two Portuguese schools. After data wrangling, the data set used here consist of 370 observations (rows) of 35 variables (columns), that is information from 370 respondents regarding 35 variables. The names of the variables included in the prepared data set are shown above and more detailed description of them is presented in the description of the Student Performance Data.
After examining the variables in the data I chose the following four variables for further examination as I assume they could be associated with the level of alcohol consumption:
My hypotheses are:
Below I present the summary tables as well as plots of each variable of interest.
# before examining the variables I'll transform the two character variables of interest to factors
alc <- mutate(alc, sex = as.factor(sex), romantic = as.factor(romantic))
# pick the names of the variables of interest
varnames <- select(alc, sex, romantic, studytime, goout, high_use) %>%
colnames()
# summary of each variable
select(alc, varnames) %>%
summary()
## sex romantic studytime goout high_use
## F:195 no :251 Min. :1.000 Min. :1.000 Mode :logical
## M:175 yes:119 1st Qu.:1.000 1st Qu.:2.000 FALSE:259
## Median :2.000 Median :3.000 TRUE :111
## Mean :2.043 Mean :3.116
## 3rd Qu.:2.000 3rd Qu.:4.000
## Max. :4.000 Max. :5.000
# a plot of each variable
gather(alc[varnames]) %>% ggplot(aes(value)) +
facet_wrap("key", scales = "free") +
geom_bar()
Next, I examine how the selected variables are related with alcohol consumption. First, all variables of interest are tabulated with the ‘high use of alcohol’ variable, after which the figure shows the proportional distributions of high users regarding each variable.
# cross-tabulations with high_use:
# sex and high_use
addmargins(table(alc$sex, alc$high_use))
##
## FALSE TRUE Sum
## F 154 41 195
## M 105 70 175
## Sum 259 111 370
# romantic and high_use
addmargins(table(alc$romantic, alc$high_use))
##
## FALSE TRUE Sum
## no 173 78 251
## yes 86 33 119
## Sum 259 111 370
# high_use and studytime
addmargins(table(alc$high_use, alc$studytime))
##
## 1 2 3 4 Sum
## FALSE 56 128 52 23 259
## TRUE 42 57 8 4 111
## Sum 98 185 60 27 370
# high_use and goout
addmargins(table(alc$high_use, alc$goout))
##
## 1 2 3 4 5 Sum
## FALSE 19 82 97 40 21 259
## TRUE 3 15 23 38 32 111
## Sum 22 97 120 78 53 370
# proportion figures
t1 <- ggplot(data = alc, aes(x = sex, fill = high_use)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
ylab("%") +
xlab("Gender")
t2 <- ggplot(alc, aes(romantic, fill = high_use)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
ylab("%") +
xlab("Romantic relationship")
t3 <- ggplot(alc, aes(x = goout, fill = high_use)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
ylab("%") +
xlab("Go out with friends")
t4 <- ggplot(alc, aes(x = studytime, fill = high_use)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
ylab("%") +
xlab("Studytime")
# all plots in one figure ('ggarrange' is from 'ggpubr' package)
ggarrange(t1 + rremove("legend"), t2 + rremove("legend"), t3 + rremove("legend"), t4 + rremove("legend"),
ncol = 2, nrow = 2,
common.legend = TRUE, legend = "right")
The tables and figure above indicate that bigger share of males are high users of alcohol compared to females. Similarly, the more the student go out with friends or the less time they use for studying, the bigger is the share of high users. In contrast, the share of high users of alcohol is quite the same among those who are in a romantic relationship and those who are not. Thus, these descriptive results provide support for the above-presented hypotheses 1, 3 and 4, while hypothesis 2 is rather questionable.
# logistic regression model
## studytime is used as factors in the model
model1 <- glm(high_use ~ sex + romantic + goout + as.factor(studytime), data = alc, family = "binomial")
# summary of the results
summary(model1)
##
## Call:
## glm(formula = high_use ~ sex + romantic + goout + as.factor(studytime),
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7282 -0.7953 -0.4800 0.8780 2.7249
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1839 0.5126 -6.212 5.25e-10 ***
## sexM 0.7096 0.2667 2.660 0.00781 **
## romanticyes -0.1061 0.2747 -0.386 0.69922
## goout 0.7426 0.1198 6.202 5.59e-10 ***
## as.factor(studytime)2 -0.4045 0.2950 -1.371 0.17033
## as.factor(studytime)3 -1.1405 0.4726 -2.413 0.01581 *
## as.factor(studytime)4 -1.3334 0.6245 -2.135 0.03276 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 379.20 on 363 degrees of freedom
## AIC: 393.2
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(model1) %>%
exp
# compute confidence intervals (CI) for the odds ratios
CI <- confint(model1) %>%
exp
## Waiting for profiling to be done...
# combine the odds ratios and their confidence intervals
OR_with_CI <- cbind(OR, CI)
# round the results
round(OR_with_CI, digits = 2)
## OR 2.5 % 97.5 %
## (Intercept) 0.04 0.01 0.11
## sexM 2.03 1.21 3.45
## romanticyes 0.90 0.52 1.53
## goout 2.10 1.67 2.68
## as.factor(studytime)2 0.67 0.37 1.19
## as.factor(studytime)3 0.32 0.12 0.78
## as.factor(studytime)4 0.26 0.07 0.83
The results from a logistic regression model are presented above. The value of null deviance is for a model without any explanatory variables and the residual deviance is the value when taking all explanatory variables into account. Higher numbers indicate bad fit, and respectively smaller values indicate good fit. The difference between null deviance and residual deviance illustrates how good is the fit of the given model with explanatory variables; the greater the difference, the better. Although the value of the residual deviance is quite high, the difference shows that the model with explanatory variables is more fit than a model including only the intercept term. The value of AIC (Akaike’s information criteria) can be used for comparing competing models (this will be discussed more below).
The coefficients illustrates the association between the explanatory variables and the target variables. The estimates can be also used, for instance, for calculating predicted probabilities of being a high user based on the known values of explanatory values. According to the coefficient table having a romantic relationship is not significantly associated with the use of alcohol. Instead, the results show that gender, time used for studying and going out with friends are statistically significant variables predicting high consumption of alcohol. However, all the associations are not rectilinear, which will be discussed along with the odds ratios.
The odds ratio (OR) illustrates the difference in the odds of being high user between groups. For instance, in the case of sex variable females are the reference group (OR = 1.00) and the odds of males are compared to the odds of females; the odds ratio shows that males (OR 2.03) are about twice as likely to be high users compared to females. Going out with friends more often increase (OR = 2.10) the likelihood of being high user compared to those who goes out the least often (note that the actual scale of the variable is somewhat obscure, and thus difficult to interpret). Finally, compared to students who study less than two hour, those who study 5 to 10 hours are about three times (OR = 0.34) less likely to be high users and respectively those who study over 10 hours (OR = 0.28) are about 3.5 times less likely to be high users. Instead, those who study 2-5 hours are not statistically more or less likely to be high users compared to the reference group (the coefficient is not statistically significant and the CI’s of OR include 0).
According to the above-presented model all variables except ‘romance’ had statistically significant relationship with high/low alcohol consumption. Next, we build a new regression model without the romance variable and examine how accurate the model predictions are.
# new logistic regression model
model2 <- glm(high_use ~ sex + as.factor(studytime) + goout, data = alc, family = "binomial")
# predicted probabilities and prediction (> 0.5) of high_use and add it to the data.frame
probabilities <- predict(model2, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the observed high use versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 234 25
## TRUE 58 53
# probability table of high use vs. predictions
prob_table <- table(high_use = alc$high_use, prediction = alc$prediction) %>%
prop.table %>%
addmargins
round(prob_table * 100, digits = 2) # probs in % and rounded
## prediction
## high_use FALSE TRUE Sum
## FALSE 63.24 6.76 70.00
## TRUE 15.68 14.32 30.00
## Sum 78.92 21.08 100.00
ggplot(alc, aes(x = probability, y = high_use, col = prediction)) +
geom_point(alpha = 0.5, size = 3)
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# Calculate the share of false predictions (false positives + false negatives)
## this can be seen from the above prob.table as well
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2243243
The first table provides cross-tabulation of the predictions against the actual values of high_use, and the next table shows the proportions of each cell. According to the proportion table about 77.6 % of predictions (cells: False/false + True/true) match with actual values of observations, while 22,4 % of the predictions (cells: True/False + False/True) are incorrect. This is also confirmed by using the self-defined ‘loss function’ to see the share of false predictions, which is 22,4 %. The plot should visualize the results, although it seems there is something wrong with the figure (but I can’t find an error in the code).
# compute the average number of wrong predictions in the (training) data
## loss func is defined in previous chunk
nwrong_train <- loss_func(class = alc$high_use, prob = alc$probability)
# mean error in in the training data
nwrong_train
## [1] 0.2243243
# 10-fold cross-validation
crossvalid <- cv.glm(data = alc, cost = loss_func, glmfit = model2, K = 10)
# the mean prediction error for the testing data
crossvalid$delta[1]
## [1] 0.2243243
The mean error in the training data is 0.224. The mean prediction error for the testing data is around 0.24 in the 10-fold cross-validation. The mean prediction error in the Datacamp model was about 0.26, suggesting that the model presented here has slightly better test performance; i.e. the model is more accurate predicting the high consumption of alcohol.
Here I utilize the AIC (Akaike’s Information Criteria) backward elimination -procedure for selecting the explanatory variables for the model. The AIC index takes into account the statistical goodness of fit and the number of variables in the model by increasing a penalty for a greater number of variables. In the series of competing models lower AIC values are preferred; i.e. the aim is to achieve as low AIC value as possible by excluding variables - that makes the value higher - one at the time. The elimination ends when removing more variables would not improve the AIC score. I start by selecting 10 variables and from there start the elimination.
# logistic regression model
model3 <- glm(high_use ~ sex + age + Pstatus + absences + failures + schoolsup + as.factor(studytime) + goout + activities + freetime, data = alc, family = "binomial")
step(model3, direction = "backward")
## Start: AIC=386.87
## high_use ~ sex + age + Pstatus + absences + failures + schoolsup +
## as.factor(studytime) + goout + activities + freetime
##
## Df Deviance AIC
## - schoolsup 1 360.87 384.87
## - Pstatus 1 360.89 384.89
## - age 1 361.00 385.00
## - as.factor(studytime) 3 365.07 385.07
## - freetime 1 361.34 385.34
## <none> 360.87 386.87
## - failures 1 363.14 387.14
## - activities 1 363.79 387.79
## - sex 1 370.47 394.47
## - absences 1 372.93 396.93
## - goout 1 392.92 416.92
##
## Step: AIC=384.87
## high_use ~ sex + age + Pstatus + absences + failures + as.factor(studytime) +
## goout + activities + freetime
##
## Df Deviance AIC
## - Pstatus 1 360.89 382.89
## - age 1 361.00 383.00
## - as.factor(studytime) 3 365.08 383.08
## - freetime 1 361.34 383.34
## <none> 360.87 384.87
## - failures 1 363.15 385.15
## - activities 1 363.80 385.80
## - sex 1 370.75 392.75
## - absences 1 372.93 394.93
## - goout 1 392.92 414.92
##
## Step: AIC=382.89
## high_use ~ sex + age + absences + failures + as.factor(studytime) +
## goout + activities + freetime
##
## Df Deviance AIC
## - age 1 361.03 381.03
## - as.factor(studytime) 3 365.09 381.09
## - freetime 1 361.38 381.38
## <none> 360.89 382.89
## - failures 1 363.18 383.18
## - activities 1 363.80 383.80
## - sex 1 370.77 390.77
## - absences 1 372.99 392.99
## - goout 1 392.99 412.99
##
## Step: AIC=381.03
## high_use ~ sex + absences + failures + as.factor(studytime) +
## goout + activities + freetime
##
## Df Deviance AIC
## - as.factor(studytime) 3 365.16 379.16
## - freetime 1 361.50 379.50
## <none> 361.03 381.03
## - failures 1 363.56 381.56
## - activities 1 364.01 382.01
## - sex 1 371.03 389.03
## - absences 1 373.55 391.55
## - goout 1 394.24 412.24
##
## Step: AIC=379.16
## high_use ~ sex + absences + failures + goout + activities + freetime
##
## Df Deviance AIC
## - freetime 1 365.69 377.69
## <none> 365.16 379.16
## - failures 1 369.03 381.03
## - activities 1 369.14 381.14
## - absences 1 379.72 391.72
## - sex 1 380.24 392.24
## - goout 1 398.78 410.78
##
## Step: AIC=377.69
## high_use ~ sex + absences + failures + goout + activities
##
## Df Deviance AIC
## <none> 365.69 377.69
## - activities 1 369.50 379.50
## - failures 1 369.59 379.59
## - absences 1 380.01 390.01
## - sex 1 382.11 392.11
## - goout 1 404.79 414.79
##
## Call: glm(formula = high_use ~ sex + absences + failures + goout +
## activities, family = "binomial", data = alc)
##
## Coefficients:
## (Intercept) sexM absences failures goout
## -3.99781 1.05729 0.08345 0.45660 0.72059
## activitiesyes
## -0.51287
##
## Degrees of Freedom: 369 Total (i.e. Null); 364 Residual
## Null Deviance: 452
## Residual Deviance: 365.7 AIC: 377.7
The backward elimination suggest that we should keep five of those original 10 variables; sex, failures, activities, absences and goout. Interestingly, studytime was excluded! Next, I will run the logistic model with those variables and conduct the 10-fold cross-validation.
model4 <- glm(high_use ~ sex + failures + activities + absences + goout,
data = alc, family = "binomial")
summary(model4)
##
## Call:
## glm(formula = high_use ~ sex + failures + activities + absences +
## goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2685 -0.7612 -0.5209 0.7648 2.4960
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.99781 0.48451 -8.251 < 2e-16 ***
## sexM 1.05729 0.26683 3.962 7.42e-05 ***
## failures 0.45660 0.23337 1.957 0.05040 .
## activitiesyes -0.51287 0.26475 -1.937 0.05272 .
## absences 0.08345 0.02258 3.695 0.00022 ***
## goout 0.72059 0.12300 5.858 4.67e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 365.69 on 364 degrees of freedom
## AIC: 377.69
##
## Number of Fisher Scoring iterations: 4
# 10-fold cross-validation
crossvalid2 <- cv.glm(data = alc, cost = loss_func, glmfit = model4, K = 10)
# the mean prediction error for the testing data
crossvalid2$delta[1]
## [1] 0.1918919
# visual
probabilities_m4 <- predict(model4, type = "response")
alc <- mutate(alc, probability_m4 = probabilities_m4)
alc <- mutate(alc, prediction_m4 = probability_m4 > 0.5)
ggplot(alc, aes(x = probability_m4, y = high_use, col = prediction_m4)) +
geom_point(alpha = 0.5, size = 3)
The table above presents the summary of the fourth model. According to the 10-fold cross-validation, the mean prediction error for the testing data is around 0.20, suggesting that this model has better test performance than the model 3 examined above. The figure provides of graphical confirmation/evidence for this assumption. Moreover, the value of residual deviance (smaller) and the difference between null vs. residual deviance (bigger) suggest that this new model is better than the previous one.
Thanks for reading and for the upcoming feedback!
date()
## [1] "Tue Dec 7 11:19:02 2021"
My fourth week in short
# access to packages
library(MASS)
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
# get the data (included in MASS)
data(Boston)
# structure of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston data set utilized here includes ‘Housing Values in Suburbs of Boston’, that is different variables relating housing in the city of Boston. The data consists of 506 observations of 14 numeric (or integer) variables (i.e. 506 rows and 14 columns). Description of the variables can be seen from the description of the data set
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# let's use gather from tidyr-package; gather returns key-value pairs of variables
# draw a bar plot of each variable
gather(Boston) %>%
ggplot(aes(value)) +
facet_wrap("key", scales = "free") +
geom_histogram()
# construct a correlation matrix and round the results
cor_matrix <-cor(Boston) %>%
round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method = "circle", type = "upper", cl.pos = "r", tl.pos = "d", tl.cex = 0.8)
Summaries of the variables and the histograms presented above illustrate that the variables in the data set have quite skewed distributions and that the variables also have quite different scales compared to each other.
The correlation matrix shows that many of variables are highly correlated. The correlation coefficients vary from 1 to -1. A positive coefficient indicate that high values of variable ‘X’ are associated with the high values of variable ‘Y’. Respectively negative coefficient indicates that high values of ‘X’ are associated with low values of ‘Y’.
First I’ll scale the Boston data, that is standardize each variable by its scale and save the standardized variables as a data.frame instead of a matrix. When variables are centered, their mean is adjusted to zero as can be seen from the summaries of the scaled variables which are shown below.
After scaling the data, I will create a new factor variable from the ‘crime rate per capita’ variable and use the quantiles as cut points. A summary of the new ‘crime’ variable can be found below.
Finally, I will split the scaled data into a training and testing data sets, which are then used in the next subchapter.
# scale the Boston data and transform the matrix to data.frame
boston_scaled <- Boston %>%
scale() %>%
as.data.frame()
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# pick the quantiles of crim
bins <- quantile(boston_scaled$crim)
# create label names
crim_lab <- c("low", "med_low", "med_high", "high")
# create new factor variable
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = crim_lab)
# add the new variable to the data frame and remove the old one
boston_scaled$crime <- crime
# NB: if dplyr is not defined below, error may occur when knitting index-file, because tries to use different package!
boston_scaled <- dplyr::select(boston_scaled, -crim)
## alternative way: boston_scaled <- subset(boston_scaled, select = -crim)
# summary of the new variable
table(boston_scaled$crime)
##
## low med_low med_high high
## 127 126 126 127
# take a sample of 80% observations; i.e. pick randomly row numbers between 1 and 0.8 x rows
# pick the number of total observations in the data
n <- nrow(boston_scaled)
# take a sample of 80% observations; i.e. pick randomly row numbers between 1 and 0.8 x rows
train_indexes <- sample(n, size = n * 0.8)
# create the training set by using the defined indexes for rows
train_set <- boston_scaled[train_indexes,]
# create the testing set by excluding the row used for training set
test_set <- boston_scaled[-train_indexes,]
Relating this part of the exercise there was multiple problems with the DataCamp exercises and the video was not available. Anycase, let’s fit the LDA and use ‘crime’ variable as the target variable and all the other variables as predictors. Below I show the results in a table and in a biplot. No interpretation were asked regarding this in the instructions, so none provided.
# fit the linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train_set)
# print the LDA results
lda.fit
## Call:
## lda(crime ~ ., data = train_set)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2351485 0.2623762 0.2524752 0.2500000
##
## Group means:
## zn indus chas nox rm age
## low 0.9798349 -0.9465893 -0.10655643 -0.8697221 0.4748522 -0.8901232
## med_low -0.1009415 -0.2776530 0.02481057 -0.5613252 -0.1559298 -0.3546641
## med_high -0.3771048 0.1505504 0.22945822 0.3759123 0.1447012 0.4577764
## high -0.4872402 1.0171306 -0.07742312 1.0388899 -0.2820864 0.7970479
## dis rad tax ptratio black lstat
## low 0.9133668 -0.6832698 -0.7490169 -0.45400774 0.37357575 -0.77398342
## med_low 0.3525735 -0.5539047 -0.5011011 -0.06966301 0.34518698 -0.10236402
## med_high -0.3976240 -0.4020080 -0.3223349 -0.35441922 0.06931202 0.02977471
## high -0.8346100 1.6379981 1.5139626 0.78062517 -0.69924847 0.78844064
## medv
## low 0.547178415
## med_low -0.007567466
## med_high 0.188948814
## high -0.656484559
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08131354 0.721876461 -0.887831317
## indus 0.03488808 -0.231172656 0.329080278
## chas -0.07288796 -0.094772345 0.072200440
## nox 0.36335569 -0.527164780 -1.268031611
## rm -0.10383783 -0.002966584 -0.158721649
## age 0.22616820 -0.405097437 -0.288889781
## dis -0.09506915 -0.183586453 0.117985138
## rad 3.18871780 0.839155198 0.001764394
## tax -0.01887380 -0.006328868 0.475442490
## ptratio 0.10189358 0.104115865 -0.259738669
## black -0.14125769 0.021807400 0.171770896
## lstat 0.21307343 -0.161061421 0.352529976
## medv 0.18262875 -0.302282089 -0.266611987
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9480 0.0377 0.0143
# draw LDA biplot
classes <- as.numeric(train_set$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
Next, I will save the observed frequencies of the crime categories from the test data and then remove the variable, after which I test how well the fitted model predicts the values. The results are shown below. The classifier, that is the fitted model, did not predict the crime rates perfectly, alghouth the predictions are somewhat accurate. Most of the predictions matches the observed classes, and those that do not match falls mostly to the next classes (i.e. prediction are not right but close).
# save the crime variable separately
correct_cases <- test_set$crime
# remove crime variable from the test data
## MASS has also select() -> dplyr::select
test_set <- dplyr::select(test_set, -crime)
# use LDA model for predicting crime cases
lda.predict <- predict(lda.fit, newdata = test_set)
# cross tabulate the correct cased with predictions
table(correct = correct_cases, predicted = lda.predict$class)
## predicted
## correct low med_low med_high high
## low 18 12 2 0
## med_low 3 15 2 0
## med_high 1 7 15 1
## high 0 0 0 26
Here, I reloaded the Boston data and recreated a new data set, in which the variables are standardized to be able to compare the distances. Next, I calculate the ‘euclidean’ distances between observations. A summary of the calculated distances are shown in the table above.
# clean the environment (i.e. data.frames and variables etc.)
rm(list = ls())
# reload the Boston dataset
data(Boston)
# scale the Boston data and transform the matrix to data.frame
boston_scaled <- Boston %>%
scale() %>%
as.data.frame()
dist_euc <- dist(boston_scaled, method = "euclidean")
summary(dist_euc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
After that K-means clustering is conducted. First with three centers and then after examining the within cluster sum of squares (WCSS) K-means clustering is run again with two centers that was suggested by the results of the investigation of the WCSS. Below the results from two-center-clustering are visualized in the scatter plot matrices in which the clusters are colored; first plot shows all of the variables (which is quite useless) and then the three plots show parts of the big picture; all of them the two more or less distinct cluster are visible.
# k-mean clustering with 4 centers
km_boston <- kmeans(boston_scaled, centers = 3)
# examination of WCSS, that is deciding optomal number of centers
# set seed
set.seed(123)
# determine the number of clusters in the examination
k_max <- 10
# calculate the total WCSS
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# examination of WCSS and the visualization suggest that two center would be optimal
km_boston <- kmeans(boston_scaled, centers = 2)
# visualze the k-mean clustering results
pairs(boston_scaled, col = km_boston$cluster)
pairs(boston_scaled[1:5], col = km_boston$cluster)
pairs(boston_scaled[6:10], col = km_boston$cluster)
pairs(boston_scaled[11:14], col = km_boston$cluster)
THAT’S ALL!
date()
## [1] "Tue Dec 7 11:19:13 2021"
library(dplyr)
library(ggplot2)
library(GGally)
library(corrplot)
library(tidyr)
The data set used here is drawn from Human Development Index and Gender Development Index. More information can be found from the United Nations Development Programme’s webpage. The prepared data set examined here consists of 155 observations (that is countries) of 8 variables.
Description of the variables in the prepared data:
The structure of the data and summaries of the variables are shown below.
# download the data
human <- read.table("data/human.csv", sep = ";")
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ GNIperCap : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ LifeExp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ ExpEdu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ MatMortalityRate : int 4 6 6 5 6 7 9 28 11 8 ...
## $ AdoBirthRate : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ FemalesParliament: num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## $ LabourRatio : num 0.891 0.819 0.825 0.884 0.829 ...
## $ SecondEduRatio : num 1.007 0.997 0.983 0.989 0.969 ...
summary(human)
## GNIperCap LifeExp ExpEdu MatMortalityRate
## Min. : 581 Min. :49.00 Min. : 5.40 Min. : 1.0
## 1st Qu.: 4198 1st Qu.:66.30 1st Qu.:11.25 1st Qu.: 11.5
## Median : 12040 Median :74.20 Median :13.50 Median : 49.0
## Mean : 17628 Mean :71.65 Mean :13.18 Mean : 149.1
## 3rd Qu.: 24512 3rd Qu.:77.25 3rd Qu.:15.20 3rd Qu.: 190.0
## Max. :123124 Max. :83.50 Max. :20.20 Max. :1100.0
## AdoBirthRate FemalesParliament LabourRatio SecondEduRatio
## Min. : 0.60 Min. : 0.00 Min. :0.1857 Min. :0.1717
## 1st Qu.: 12.65 1st Qu.:12.40 1st Qu.:0.5984 1st Qu.:0.7264
## Median : 33.60 Median :19.30 Median :0.7535 Median :0.9375
## Mean : 47.16 Mean :20.91 Mean :0.7074 Mean :0.8529
## 3rd Qu.: 71.95 3rd Qu.:27.95 3rd Qu.:0.8535 3rd Qu.:0.9968
## Max. :204.80 Max. :57.50 Max. :1.0380 Max. :1.4967
The graphical overview of the data is presented below. The scatter plot matrix shows the distributions of the variables and illustrates their relationship. A better overview of the relationship between the variables is provided by the correlation matrix, which shows that many of the variables are highly correlated with each other. The correlation coefficients vary from 1 to -1. A positive coefficient indicates that high values of variable ‘X’ are associated with the high values of variable ‘Y’. Respectively negative coefficient indicates that high values of ‘X’ are associated with low values of ‘Y’.
# scatter plot matrix using ggplot2 and GGally -packages
ggpairs(human, lower = list(combo = wrap("facethist", bins = 20)))
# correlation matrix
cor(human) %>%
corrplot(method = "circle", type = "upper", cl.pos = "r", tl.cex = 0.8)
Next, I will perform principal component analysis (PCA) on the human data: first with unstandardized data and after that with standardized data set.
# PCA with SVD method (unstandardized data)
pca_human_ustd <- prcomp(human)
# summary of PCA results
smry <- summary(pca_human_ustd)
smry
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
#rounded percentages of variance captured by each PC
pca_pr <- round(1*smry$importance[2, ] * 100, digits = 2)
# create object pc_lab (variance captured) to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# biplot of the resuts
biplot(pca_human_ustd,
choices = 1:2,
cex = c(0.7, 0.9),
col = c("black", "red"),
xlab = pc_lab[1], ylab = pc_lab[2])
The results from PCA performed with unstandardized data are shown above in the summary table and graphically in the biplot. They clearly show that the first principal component captures (misleadingly) almost all variance (99.99%), which is due to the use of unstandardized data. PCA is sensitive to the relative scaling of the data and it assumes that variables with larger variance are more important than those with smaller variance, which is not the case here and, thus the data must be standardized before performing PCA.
The results from PCA of standardized data are presented below. Now the first principal component captures about 54 percent of the variance and the second component about 16 percent. In the biplot the arrows can be interpreted as follows:
From the figure one can notice, for instance, that GNI per capita, life expectancy, expected education and gender ratio in the secondary education are all highly ‘positively’ correlated with each other. Respectively, maternal mortality is positively correlated with adolescence birth rate and these two are negatively correlated with the variables mentioned in the previous sentence (e.g. maternal mortality is negatively correlated with life expectancy). Moreover, the figure illustrate that all these above mentioned features are contributing to the first principal component. Instead, the variables FemalesParliament and LabourRatio are contributing to the second principal component and the vertical arrows illustrate that the share of women in the parliament correlates positively with the labour ratio between the genders.
# standardize the human data and save it as data frame
human_std <- as.data.frame(scale(human))
# PCA with SVD method (standardized data)
pca_human_std <- prcomp(human_std)
# summary of PCA results
smry_std <- summary(pca_human_std)
#rounded percentages of variance captured by each PC
pca_pr_std <- round(1*smry_std$importance[2, ] * 100, digits = 2)
# create object pc_lab (variance captured) to be used as axis labels
pc_lab_std <- paste0(names(pca_pr_std), " (", pca_pr_std, "%)")
# biplot of the resuts
biplot(pca_human_std,
choices = 1:2,
cex = c(0.5, 0.7),
col = c("black", "red"),
xlab = pc_lab_std[1], ylab = pc_lab_std[2])
The ‘tea’ data used here consists of 300 observations of 36 variables. I do not see any reason for visualizing the whole data set, so I just don’t do it. Instead, I pick a few of the variables. The summary and visualization of the subset is shown below.
# load the data set from FactoMineR-package
library(FactoMineR)
data("tea")
# explore the data
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# select a few colums and create a new data set
keep_vars <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_vars))
# structure and summary of the new dataset
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
#visualization of the new data set
gather(tea_time) %>% ggplot(aes(value)) +
facet_wrap("key", scales = "free") +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Next, I will perform the Multiple Correspondence Analysis (MCA). The visualization of the results from MCA are shown below.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the MCA model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# plot the MCA results
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")
In the figure the variables are drawn on the first two dimensions. The distance between the categories provides a measure of their similarity. The vertical dimension relates, at least to some extent, to from where the tea is bought. Respectively, the horizontal dimension seems to be related to how the tea is consumed (e.g. tea bags, unpacked etc.). From the figure one can interpret, for example, that buying tea from a tea shop is closer to using unpacked tea than to using tea bags and similarly getting tea from a chain store is closer to using tea bags than to using unpacked tea.
date()
## [1] "Tue Dec 7 11:19:23 2021"
# packages in use
library(dplyr)
library(ggplot2)
library(tidyr)
In this subchapter I will implement the analyses of Chapter 8 of MABS using the RATS data.
The RATS data utilized here is drawn from a nutrition study conducted with rats (see Crowder & Hand 1990). In the experiment the rats were grouped into three groups which had different diets, and each rat’s weight (grams) was recorded weekly over a 9-week period (with an exception of week seven when two measurements were taken).
The original RATS data included 16 observations of 13 variables, that is 16 rats with multiple measurements over time (and a few other variables). Here, the data is transformed from wide to long format by time (measurements), meaning that each time point (i.e. measurement) forms an observation and, thus, every individual rat has as many observations as there are measurements. Thererore, in the long format there are 176 observations of 4 variables, which are:
NB: The variable ‘WD’ is removed from the data set as its information is already included in the ‘Time’ variable and thus there is no use for it. In addition, ID and Group variables are transformed to factors.
# load the data in long format
rats_long <- read.table("data/rats_long.csv", sep = ";")
# prepare data for the analysis
## remove WD variable
## Group and ID -> factors
rats_long <- rats_long %>%
dplyr::select(-WD) %>%
mutate(Group = as.factor(Group),
ID = as.factor(ID))
str(rats_long)
## 'data.frame': 176 obs. of 4 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
head(rats_long)
## ID Group Weight Time
## 1 1 1 240 1
## 2 2 1 225 1
## 3 3 1 245 1
## 4 4 1 260 1
## 5 5 1 255 1
## 6 6 1 260 1
glimpse(rats_long)
## Rows: 176
## Columns: 4
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
First, I draw a plot for every observed rat; the plot shows measured weight (y-axis) over time (x-axis) for every rat (lines). Rats in every group have gained weight during the experiment, although the growth seems to be bigger in groups 2 and 3 compared to group 1. However, the plot also illustrates that in the beginning of the experiment (time 1) the rats differ from each other in terms of the weight and bigger rats are bigger throughout the experiment. Rats in group 1 are much smaller than those in the groups 2 and 3. Also there is one much bigger individual in the group 2.
# Draw a plot for rats in each group over time
ggplot(rats_long, aes(x = Time, y = Weight, linetype = ID, col = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:8, times=4)) + #what this does?
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(rats_long$Weight), max(rats_long$Weight))) +
ggtitle("Individual weight profiles over time by group for the RATS data") +
xlab("Time (days)")
Next, I will standardize/center the weight variable and repeat the above-presented graphical examination. The graphical results are similar, although the trend in weight gain does not seem to be as straightforward as in the previous plot, especially regarding group 3.
# Standardize the variable weight
rats_long<- rats_long %>%
group_by(Time) %>%
mutate(Weight_std = (Weight - mean(Weight)) / sd(Weight)) %>%
ungroup()
# Draw the plot using the stardardixed weight variable
ggplot(rats_long, aes(x = Time, y = Weight_std, linetype = ID, col = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:8, times=4)) + #what this does?
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(name = "Standardized weight") +
ggtitle("Individual weight profiles over time by group for the RATS data after standardization") +
xlab("Time (days)")
To get a better picture of the weight gain and the differences between the groups, I will next produce a graph showing average profiles for each group. Additionally I draw a graph showing side-by-side box plots of the observations at each time point. The graphs illustrate that every group has gained weight over time, although the weight gain is a bit different in different groups. The figures also show that the mean weights of groups 2 and 3 clearly differ from group 1. In addition, although there is some overlap between the groups 2 and 3 profiles it seems that they do differ from each other. Moreover, the box plots suggest the presence of multiple possible outliers at a number of time points.
# Summary data with mean and standard error of weight by treatment group and time
# Note: DIFFRENT NUMBER OF RATS IN DIFFERENT GROUPS!
## use function n() inside sqrt(); n() gives the current group size
rats_long_s <- rats_long %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = sd(Weight) / sqrt(n())) %>%
ungroup()
# Plot the mean profiles
ggplot(rats_long_s , aes(x = Time, y = mean, linetype = Group, shape = Group, col = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.45)) +
scale_y_continuous(name = "mean(weight) +/- se(weight)") +
xlab("Time (days)") +
ggtitle("Mean response profiles for the three groups of rats in the RATS data")
# Alternative mean profile figure: side-by-side box plots of the observations at each time point
ggplot(rats_long, aes(x = as.factor(Time), y = Weight, fill = Group)) +
geom_boxplot(position = position_dodge(width = 0.9)) +
theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
theme(legend.position = c(0.9,0.4)) +
scale_x_discrete(name = "Time (days)") +
ggtitle("Boxplots for the RATS data: weight over time")
I will continue the summary approach and look into the post-treatment values of the weight. The mean of weight (all weeks) will be the summary measure. Day 1 will be considered here as the baseline and it is excluded from the examination.
# Create a summary data by group and subject with mean as the summary variable (ignoring baseline day 1).
rats_summary <- rats_long %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# Draw a boxplot of the mean versus group
ggplot(rats_summary, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), days 8-64") +
ggtitle("Boxplots of mean summary measures for the three groups in RATS data")
The box plots presented above shows again that the groups differ from each other and also that every group has an outlier, although only in the case of group two the outlier is quite far from the others. Thus, I will remove the outlier of group 2 and draw the box plots again. After removing the outlier, the differences between the groups are even clearer as can be seen in the picture below.
rats_summary_filt <- rats_summary %>%
filter(mean < 550)
# Draw a boxplot of the mean versus group
ggplot(rats_summary_filt, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), days 8-64") +
ggtitle("Boxplots of mean summary measures for the three groups in RATS data \n(without outlier)")
Next, I will first fit a linear model with the mean weight as the response and group as the only explanatory variable. The results of the model are shown below.
# Fit the linear model with the mean as the response and Group as explanatory
fit <- lm(mean ~ Group, data = rats_summary)
# summary of the fitted model
summary(fit)
##
## Call:
## lm(formula = mean ~ Group, data = rats_summary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.900 -27.169 2.325 9.069 106.200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 265.02 12.98 20.419 2.92e-11 ***
## Group2 222.77 22.48 9.909 2.00e-07 ***
## Group3 262.48 22.48 11.675 2.90e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.71 on 13 degrees of freedom
## Multiple R-squared: 0.9316, Adjusted R-squared: 0.9211
## F-statistic: 88.52 on 2 and 13 DF, p-value: 2.679e-08
# analysis of variance table for the fitted model
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 238620 119310 88.525 2.679e-08 ***
## Residuals 13 17521 1348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As can be seen from the summary table above, the positive regression coefficients of both groups are statistically significant (p < 0.001). Similarly, the analysis of variance (ANOVA) shows that the weight of the rats differ between the examined groups and that the result is highly significant (p < 0.001). Next, I will add the baseline measurement (day 1) from the original data to the summary data and include it as a covariate in the linear regression model. The results of the linear regression model and ANOVA are presented below.
## load the data in wide form
rats_wide <- read.table("data/rats_wide.csv", sep = ";")
# Add the baseline (Day 1) from the original data as a new variable to the summary data
rats_summary2 <- rats_summary %>%
mutate(baseline = rats_wide$WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = rats_summary2)
# summary of the fitted model
summary(fit)
##
## Call:
## lm(formula = mean ~ baseline + Group, data = rats_summary2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.905 -4.194 2.190 7.577 14.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.16375 21.87657 1.516 0.1554
## baseline 0.92513 0.08572 10.793 1.56e-07 ***
## Group2 34.85753 18.82308 1.852 0.0888 .
## Group3 23.67526 23.25324 1.018 0.3287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.68 on 12 degrees of freedom
## Multiple R-squared: 0.9936, Adjusted R-squared: 0.992
## F-statistic: 622.1 on 3 and 12 DF, p-value: 1.989e-13
# analysis of variance table for the fitted model
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When taking the baseline into account results are quite different! According to the above presented results from the linear regression model, the coefficient of group 2 or 3 are not statistically significant (when considering p < 0.05 as the threshold). Analysis of variance provides similar results, indicating that the difference between the groups is not statistically significant when taking the baseline into account. Although, naturally these are only preliminary results and, for instance, the longitudinal structure of the data is ignored. More sophisticated and proper analysis of the data can be found in the chapter 9 of MABS.
In this subchapter I will implement the analyses of Chapter 9 of MABS using the BPRS data.
In this exercise I will use data taken from Davis (2002). The data is from an experimental study in which 40 males were randomly assigned to two treatment groups and each participant was rated on a psychiatric rating scale (BPRS) measured before the treatment started (week 0) and weekly after that for eight weeks. Here, the data is already transformed from wide to long form.
# load the data in long format
bprs_long <- read.table("data/bprs_long.csv", sep = ";")
# prepare data for the analysis
## remove WD variable
## Group and ID -> factors
bprs_long <- bprs_long%>%
dplyr::select(-weeks) %>%
mutate(treatment = as.factor(treatment),
subject = as.factor(subject),
id = as.factor(id))
str(bprs_long)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ id : Factor w/ 40 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
summary(bprs_long)
## treatment subject id bprs week
## 1:180 1 : 18 1 : 9 Min. :18.00 Min. :0
## 2:180 2 : 18 2 : 9 1st Qu.:27.00 1st Qu.:2
## 3 : 18 3 : 9 Median :35.00 Median :4
## 4 : 18 4 : 9 Mean :37.66 Mean :4
## 5 : 18 5 : 9 3rd Qu.:43.00 3rd Qu.:6
## 6 : 18 6 : 9 Max. :95.00 Max. :8
## (Other):252 (Other):306
head(bprs_long)
## treatment subject id bprs week
## 1 1 1 1 42 0
## 2 1 2 2 58 0
## 3 1 3 3 54 0
## 4 1 4 4 55 0
## 5 1 5 5 72 0
## 6 1 6 6 48 0
BPRS data includes multiple measurements from 40 subjects. As the data is transformed from wide to long format by time (measurements), each time point forms an observation and, thus, every participant has as many observations as there are measurements. Note that the numbering of the subjects/participants goes from 1 to 20 in both treatment groups, which is why I have created a new id-variable into the data which goes from 1 to 40 identifying each participants (see the R-Script for the coding of id-variable). In the long form there are 360 observations of 5 variables, which are:
NB: The variable ‘weeks’ is removed from the data set as its information is already included in the ‘week’ variable and thus there is no use for it. In addition, subject and treatment variables are transformed to factors.
Let’s first take brief graphical examination of the data! I’ll begin by plotting the BPRS-ratings over time and identifying the treatment groups in which the observations belong while ignoring the longitudinal nature of the data (i.e. which subject’s observations are in question). The below-presented plot shows a downward trend in the BPRS-scores, especially regarding the first treatment group.
ggplot(bprs_long, aes(x = week, y = bprs, group = id, col = treatment)) +
geom_text(aes(label = treatment)) +
scale_x_continuous(name = "Time (week)", breaks = seq(0, 11, 1)) +
scale_colour_discrete(labels = c("Group 1", "Group 2")) +
scale_y_continuous(name = "BPRS") +
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
ggtitle("Plot of BPRS-scores over time.\nThe treatment groups are identified by numbers and colors")
Next, I will take the longitudinal nature of the data into account and draw a plot of individual BPRS profiles. The individual profiles show similar downward trend in the BPRS scores over time, although in the group 2 there seem to be more departures from this trend.
The next figure, that is the scatter plot matrix, demonstrates that the repeated measures of BPRS are not independent of one another, which is no surprise when considering the longitudinal nature of the data.
# Plot the RATSL data
ggplot(bprs_long, aes(x = week, y = bprs, group = id, col = treatment)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 11, 1)) +
scale_y_continuous(name = "BPRS") +
theme_bw() + theme(legend.position = "top") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
ggtitle("Plot of individual BPRS profiles")
# load the data in wide form and draw a scatter plot matrix
bprs_wide <- read.table("data/bprs_wide.csv", sep = ";")
pairs(bprs_wide[, 3:11], cex = 0.7)
Now I’ll fit a linear regression model in which the BPRS is the response variable and time and treatment are the explanatory variables. It should be noted, that in this independent model the repeated-measures structure of the data is ignored. According to the results (see below) the estimate of time is statistical significant (p < 0.001), meaning that the BPRS scores decrease over time conditional on treatment group. However, the regression coefficient of treatment group is not significant indicating that the BPRS scores do not differ between the groups conditional on time.
# fit random intercept model
bprs_m1 <- lm(bprs ~ week + treatment, data = bprs_long)
summary(bprs_m1)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = bprs_long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
The independent model conducted above assumes an independence of the measures of the response variable (bprs), which is highly unlikely here. Thus, I will fit a random intercept model, which allows the linear regression fit for each participant to differ in intercept from other participants.
# access to lme4-package that provide functions for fitting mixed models
library("lme4")
# to add p-values to the summary
#install.packages("lmerTest")
library(lmerTest)
# note: in the formula the random-effects terms distinguished by vertical bars (|)
bprs_rim1 <- lmer(bprs ~ week + treatment + (1 | id), data = bprs_long, REML = FALSE)
# summary of the random intercept model
summary(bprs_rim1)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
## method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (1 | id)
## Data: bprs_long
##
## AIC BIC logLik deviance df.resid
## 2582.9 2602.3 -1286.5 2572.9 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27506 -0.59909 -0.06104 0.44226 3.15835
##
## Random effects:
## Groups Name Variance Std.Dev.
## id (Intercept) 97.39 9.869
## Residual 54.23 7.364
## Number of obs: 360, groups: id, 40
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 46.4539 2.3521 45.7607 19.750 <2e-16 ***
## week -2.2704 0.1503 320.0000 -15.104 <2e-16 ***
## treatment2 0.5722 3.2159 40.0000 0.178 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.256
## treatment2 -0.684 0.000
The results from the random intercept model are presented above. The estimates of time (i.e. week) and treatment group are quite similar to those from the previous model. The p-value of the time’s estimate is statistically significant, while in the case of treatment group it is not. The magnitudes of the t-values provide similar evidence as the magnitude of week’s t-value (-15.1) is quite large, while the magnitude of treatment’s t-value (0.2) is really small (see Note 1 below). Interpretation is the same as above, that the results suggest that the BPRS scores decrease by time conditional on treatment.
The estimated variance of the participant (id) random effects is pretty large (97.39), which indicates considerable variation in the intercepts of the regression fits of the individual BPRS profiles. Thus, I will next fit a random intercept and random slope model, which allows the linear regression fits for each individual to differ in slope as well. This way it is possible to detect the individual differences in the participants’ BPRS profiles, and also the effect of time.
Note 1: The greater is the magnitude of t-value, the greater is the evidence against the null hypothesis, that is: the greater evidence that there is a significant difference. Respectively, the closer t is to 0, the more likely there is not a statistically significant difference.
Note 2: Even though the significance of the results or coefficients of the mixed effect models is commented multiple times in the chapter 9 of MABS, there are no p-values or explanations how to interpret the significance of the results from the t-values, which I found a bit annoying. Here, I installed lmerTest-package which adds p-values to the lrm-summary tables.
# random intercept and random slope model
bprs_rim2 <- lmer(bprs ~ week + treatment + (week | id), data = bprs_long, REML = FALSE)
# summary of the random intercept and slope model
summary(bprs_rim2)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
## method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week | id)
## Data: bprs_long
##
## AIC BIC logLik deviance df.resid
## 2523.2 2550.4 -1254.6 2509.2 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4655 -0.5150 -0.0920 0.4347 3.7353
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## id (Intercept) 167.827 12.955
## week 2.331 1.527 -0.67
## Residual 36.747 6.062
## Number of obs: 360, groups: id, 40
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 45.9830 2.6470 49.9491 17.372 < 2e-16 ***
## week -2.2704 0.2713 40.0001 -8.370 2.51e-10 ***
## treatment2 1.5139 3.1392 39.9998 0.482 0.632
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.545
## treatment2 -0.593 0.000
# anova
anova(bprs_rim1, bprs_rim2)
## Data: bprs_long
## Models:
## bprs_rim1: bprs ~ week + treatment + (1 | id)
## bprs_rim2: bprs ~ week + treatment + (week | id)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## bprs_rim1 5 2582.9 2602.3 -1286.5 2572.9
## bprs_rim2 7 2523.2 2550.4 -1254.6 2509.2 63.663 2 1.499e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results from the random intercept and random slope model are presented above. The fixed effect estimate of time (i.e. week) is the same as in the random intercept model, but the fixed effect estimate of treatment group is larger, suggesting bigger difference in the BPRS scores between the groups. However, as can be seen from the summary table, according to the p-values the estimate of treatment is not statistically significant, while estimate of time is and the t-values provide similar evidence.
The likelihood ratio test for the two models gives a chi-squared statistic of 66.66 with 2 degrees of freedom, and the associated p-value is really small (p < 0.001), meaning that the random intercept and slope model provides a better fit for the data.
Finally, I fit a random intercept and slope model that allows an interaction between the treatment group and time. The results from this model are shown below. The results differ somewhat from the previous model, but more importantly the likelihood ratio test for the random intercept and slope model versus the random intercept and slope model with interaction term shows that the latter provides worse fit for the data, and thus the results are not interpret here any further.
# random intercept and random slope model with an interation term
bprs_rim3 <- lmer(bprs ~ week * treatment + (week | id), data = bprs_long, REML = FALSE)
# summary of the model
summary(bprs_rim3)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
## method [lmerModLmerTest]
## Formula: bprs ~ week * treatment + (week | id)
## Data: bprs_long
##
## AIC BIC logLik deviance df.resid
## 2523.5 2554.5 -1253.7 2507.5 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4747 -0.5256 -0.0866 0.4435 3.7884
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## id (Intercept) 164.204 12.814
## week 2.203 1.484 -0.66
## Residual 36.748 6.062
## Number of obs: 360, groups: id, 40
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 47.8856 2.9840 40.0010 16.047 < 2e-16 ***
## week -2.6283 0.3752 40.0021 -7.006 1.84e-08 ***
## treatment2 -2.2911 4.2200 40.0010 -0.543 0.590
## week:treatment2 0.7158 0.5306 40.0021 1.349 0.185
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.668
## treatment2 -0.707 0.473
## wek:trtmnt2 0.473 -0.707 -0.668
# anova
anova(bprs_rim2, bprs_rim3)
## Data: bprs_long
## Models:
## bprs_rim2: bprs ~ week + treatment + (week | id)
## bprs_rim3: bprs ~ week * treatment + (week | id)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## bprs_rim2 7 2523.2 2550.4 -1254.6 2509.2
## bprs_rim3 8 2523.5 2554.6 -1253.7 2507.5 1.78 1 0.1821
The very last thing to do is to draw one more figure which shows the observed and fitted BPRS profiles. Based on the above presented results I will use the random intercept and slope model without interaction term for fitting values. The two figures are presented below. They illustrate how well the model fits the observed data. Model seems to fit somewhat roughly and there are also quite big differences between the observed and fitted lines, but everyone can take a look and make a judgement of their own.
# create fitted data
Fitted <- fitted(bprs_rim2)
# add the fitted values to the data frame
bprs_long <- bprs_long %>%
mutate(Fitted = Fitted)
graph1 <- ggplot(bprs_long , aes(x = week, y = bprs, group = id, col = treatment)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 11, 1)) +
scale_y_continuous(name = "BPRS") +
theme_bw() + theme(legend.position = "right") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
ggtitle("Observed BPRS")
graph2 <- ggplot(bprs_long , aes(x = week, y = Fitted, group = id, col = treatment)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 11, 1)) +
scale_y_continuous(name = "BPRS") +
theme_bw() + theme(legend.position = "right") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
ggtitle("Fitted BPRS")
library(ggpubr)
ggarrange(graph1, graph2,
ncol = 2, nrow = 1,
common.legend = TRUE, legend = "right")
THAT’S
ALL FOLKS AND DOCS!
THE END.